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Abstract. In many inverse problems, model parameters cannot be precisely determined from observational 
data. Bayesian inference provides a mechanism for capturing the resulting parameter uncertainty, 
but typically at a high computational cost. This work introduces a multiscale decomposition that 
exploits conditional independence across scales, when present in certain classes of inverse prob¬ 
lems, to decouple Bayesian inference into two stages: (1) a computationally tractable coarse-scale 
inference problem; and (2) a mapping of the low-dimensional coarse-scale posterior distribution 
into the original high-dimensional parameter space. This decomposition relies on a character¬ 
ization of the non-Gaussian joint distribution of coarse- and fine-scale quantities via optimal 
transport maps. We demonstrate our approach on a sequence of inverse problems arising in sub¬ 
surface flow, using the multiscale finite element method to discretize the steady state pressure 
equation. We compare the multiscale strategy with full-dimensional Markov chain Monte Carlo 
on a problem of moderate dimension (100 parameters) and then use it to infer a conductivity 
field described by over 10 000 parameters. 

Key words. Bayesian inference, inverse problems, multiscale modeling, multiscale finite element method, 
optimal transportation, Markov chain Monte Carlo 

1. Introduction. Mathematical models often contain parameters that must be esti¬ 
mated from observational data, before the models can be used for prediction or design. 
Deterministic approaches have long been applied to such inverse problems (e.g., [4, 10, 14, 
20, 65]). Yet observations can seldom constrain model parameters precisely, and the re¬ 
sulting inverse problems are thus ill-posed. In this context, statistical approaches—e.g., 
the Bayesian approach [38, 58, 61]- provide a rigorous framework for simultaneously char¬ 
acterizing model parameters and their uncertainties [6]. This characterization is particu¬ 
larly crucial for applications requiring quantified uncertainties in model predictions (e.g., 
[19, 22, 62, 67]). 

Ill-posed inverse problems often result from the combination of high-dimensional pa¬ 
rameters and indirect observations, especially when the observations smooth or integrate 
the parameter held of interest. In this setting, the map from parameters to observations 
is called the forward model. Smoothing forward models are ubiquitous in science and 
engineering applications, ranging from how through porous media to tomography and re¬ 
mote sensing. A significant body of research has sought to develop and analyze multiscale 
methods for evaluating such forward models [1, 3, 17, 26, 29, 31, 35-37]. Many of these 
methods rely on the idea that a finite dimensional coarse-scale representation can include 
the impact of hne-scale structure in the parameters without resolving the problem to that 
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level. For example, in the context of partial differential equations (PDEs), fine-scale spatial 
variation in a coefficient or initial condition can be captured by homogenized coefficients, 
coarse-scale basis functions, or other correction terms to the variational statement of the 
problem, enabling the PDE solution to be accurately approximated at reduced cost. A host 
of theoretical developments and numerical approaches [30, 32, 53] have yielded multiscale 
solution strategies for linear and nonlinear PDEs, ODEs, and other systems [45]. 

Implicit in the success of a multiscale strategy for solving the forward model is a notion 
of conditional independence: observations of the solution can be directly predicted by some 
coarse-scale quantities; conditioned on these coarse-scale quantities, the observations and 
the original parameter held of interest (which resolves fine-scale structure) are independent. 
We will use this notion to design a multiscale Bayesian inference approach. Our approach 
will decompose the inverse problem into a coarse-scale inference problem, where coarse- 
scale quantities are inferred from data, and a fine-scale inference problem, where fine-scale 
parameters are conditioned on realizations of the coarse scale. This strategy will reduce the 
effective parameter dimension of the inverse problem, increase parameter identihability, and 
provide significant opportunities for parallel computation. Our framework will accommo¬ 
date nonlinear, non-deterministic, and non-Gaussian relationships between the coarse- and 
fine-scale quantities. The specific coarse-scale quantity that we employ is entirely flexible in 
that it is a consequence of the multiscale solution method chosen for the forward problem; 
our framework is thus, in principle, applicable to a variety of multiscale modeling methods 
and a broad array of problems containing multiscale structure. 

In the Bayesian setting, knowledge about the values and structure of the parameters is 
represented via the assignment of probabilities [23, 33, 34, 38, 58, 61]. The prior probability 
distribution represents an initial state of knowledge, which is updated via the likelihood 
function to obtain the posterior distribution. The likelihood often couples a physics-based 
forward model (e.g., a PDE) with a probabilistic description of observation and model errors. 
Except in simple cases, the posterior distribution cannot be characterized in closed form and 
one must instead resort to sampling approaches, e.g., Markov chain Monte Carlo (MCMC) 
[9, 41, 54], importance sampling [41], sequential Monte Carlo [16], or variational approaches 
including transport maps [47]. High-dimensional parameter spaces and computationally 
expensive models make all of these methods more challenging to apply. Significant research 
effort has thus been devoted to dimension reduction approaches that can make sampling 
more efficient, and reduce the number of forward model evaluations required. 

For example, in the Bayesian approach to inverse problems, Karhunen-Loeve (KL) ex¬ 
pansions based on the prior [18, 40, 43] have proven useful in reducing parameter dimension. 
But this approach is practically limited to rather smoothing priors, and does not account 
for the forward model and the data in identifying relevant parameter directions. More 
recent work has introduced the notion of a likelihood-informed subspace (LIS), which con¬ 
tains parameter directions where the posterior distribution is most different from the prior 
[12, 13, 56]. Construction of the LIS requires finding the dominant eigenmodes of the 
Hessian of the log-likelihood, preconditioned by the prior covariance, at many points in the 
parameter space. These approaches rely on a heuristic globalization of Hessian information, 
but yield only a linear subspace of the original parameter space. 

Other approaches to dimension reduction in Bayesian inversion explicitly take advantage 
of multiscale structure. Existing approaches generally do so in two ways: (1) by using 
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multiple discretizations of the parameter field itself [27, 28, 68], or (2) by employing efficient 
multiscale numerical solvers for the forward model. [18] and [15] are in this second category 
and are most related to the present work. These efforts perform inference on a single 
discretization of the parameter field, but use a multiscale forward solver to drive a “delayed- 
acceptance” MCMC scheme [11], where proposed samples are first accepted or rejected 
according to an approximation of the forward model solution. These approaches improve 
MCMC efficiency but still require, per accepted sample, at least one new forward solve 
that explicitly resolves the fine scales. [48, 49] analyze the impact of the homogenization 
of elliptic operators on parameter estimates. This analysis requires specific forms for the 
parameter fields, however, and focuses on maximum a posteriori (MAP) estimation rather 
than the full Bayesian posterior. 

With the exception of [48] and [49], the multiscale sampling approaches described above 
seek to accelerate asymptotically exact sampling of a posterior distribution on a fine-scale 
representation of the parameters. Repeated calls to a forward solver that resolves the fine 
scales then remain an “online” part of each inference algorithm—in the sense that these 
calls must be performed after observations are obtained. This requirement carries signif¬ 
icant computational expense. We will exploit multiscale structure in a different manner, 
approximating the posterior according to the conditional independence assumptions de¬ 
scribed earlier and reducing the online time required for posterior exploration. We generate 
a posterior on the coarse-scale representation and use the conditional distribution of the fine 
scales to “prolong” samples of the coarse-scale posterior to the original high-dimensional pa¬ 
rameter space containing fine-scale structure. In contrast with many previous approaches, 
we do not require particular forms for the forward model or prior distribution; we avoid 
globally resolving the fine scales in any forward solves; and we allow for rather general 
multiscale models whose coarse or fine features may not even lie on a mesh. 

To capture the stochastic and in general non-Gaussian relationship among the scales, 
we will rely on transport maps —deterministic functions that push forward a reference dis¬ 
tribution to a more complex target distribution. We will construct transport maps in a 
particular form that enables efficient marginalization and conditioning, both of which are 
basic operations for our multiscale inference framework. Importantly, these transport maps 
can be constructed a priori, before any data is observed, significantly reducing online com¬ 
putational effort. We will also take advantage of stationarity and locality, when present 
in the problem, to accelerate the construction of transport maps that represent the joint 
distribution of the fine and coarse scales. 

The remainder of this paper is organized as follows. Section 2 introduces a multiscale 
decomposition of the inverse problem in a Bayesian setting. Section 3 introduces transport 
maps and shows how they can be used to render the decomposition of Section 2 into an 
effective algorithm. Section 4 provides a small illustrative example, Section 5 describes 
strategies for applying the multiscale framework to elliptic PDE inverse problems, and Sec¬ 
tion 6 demonstrates the efficiency of our approach on two large subsurface flow applications. 

2. A framework for multiscale Bayesian inference. Let 6 be a random variable taking 
values in M de . In the discussion below, 6 will represent the parameter field we wish to infer, 
which may contain fine scale structure. For simplicity, we will assume that all probability 
distributions have densities with respect to Lebesgue measure. To keep notation straight¬ 
forward, we will also specify density functions by their arguments, except where this might 
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be ambiguous. 

In the Bayesian context, the prior probability density n(9) represents prior knowledge 
about the random variable 9. Observations are represented by an Revalued random vari¬ 
able y; conditioning on a particular value of these observations then yields a posterior 
probability density according to Bayes’ rule: 


( 2 . 1 ) 


vr(/9|y) 


n{y,Q) 

7T (y) 


7 r(y|( 9 ) 7 r( 6 >) 

7T (y) 


oc Tr(y\9)Tr(0). 


Here ir(y\9), viewed as a function of 6, is the likelihood function. The normalizing constant 
7 T(y) is called the evidence. In the Bayesian setting for inverse problems, the likelihood func¬ 
tion typically contains a deterministic forward model /, and compares f(9) to y according 
to a statistical model for observation and model errors [38, 58]. 

2.1. A statistical definition for multiscale models. Many systems contain parameter¬ 
ized features or dynamics that span multiple scales of spatial or temporal variation. Of 
course, this is a rather general observation. To be more precise, our framework employs a 
specific definition of “multiscale” structure. We will say that a model mapping 9 to y has 
multiscale structure if there is a quantity 7 such that y and 9 are conditionally independent 
given 7 , i.e.: 


( 2 . 2 ) 


Ti-(y |7,0) = Tr(y|7). 


Very often 7 will be naturally suggested by the forward model, and will represent a coarse- 
scale quantity derived from the parameters 9 and the forward operator. To make the 
introduction of the coarse-scale quantity useful, the dimension of 7 , denoted by d 7 , should 
be smaller than the dimension of 9. 

Even though this definition might seem abstract, many real systems exhibit behavior 
that approximately satisfies (2.2). For example, any deterministic model f{9) that can 
be written as f(9) = / (y(x)), where / : R ^ 7 —»• and g : R rfs — > R^ 7 , will yield a 
posterior that satisfies (2.2). In subsurface flow applications, for instance, 9 could be a 
conductivity field discretized on a mesh that resolves the fine scales and 7 = g(9) could be a 
coarse “upscaled” conductivity field, where / and / provide predictions of hydraulic head. 
Further examples exist in any application where observables depend on some aggregate, 
integrated, or homogenized behavior of the fine scale parameter 9. 

Of course, in many systems, the equality (2.2) is only approximately satisfied. With 
the resulting likelihood approximation ir(y\'y,9) ~ 7 r(y| 7 ), only approximate posterior sam¬ 
ples of 9 can be obtained. In applications where multiscale forward modeling has been 
successful, however, this approximation error can be quite small. Moreover, in practice, 
the computational advantages of using the coarse parameter 7 may greatly outweigh the 
drawbacks of a posterior approximation. As we will demonstrate in Section 6 . 2 , exploiting 
multiscale structure can allow us to tackle very large problems where directly sampling 
ir(9\y) is otherwise intractable. 

2.2. Algorithmic building blocks: decoupling the coarse and fine scales. In the usual 
“single-scale” setting, we obtain a posterior ir(9\y) according to Bayes’ rule as shown in 
(2.1). Let us instead consider the joint posterior distribution of the coarse- and fine-scale 
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parameters ( 0 , 7 ): 
(2.3) 


n(0, 7 |y) oc 7 r(y| 0 , 7 ) 77 ( 7 , 0 ) 

~ 7r(y|7) 7 r (7, 0 ) 

= vr(y|7)7r(7)7r(0|7). 

In moving from the first line to the second, we applied the conditional independence as¬ 
sumption (2.2). Then we expanded the joint prior 77 ( 7 , 0 ) into the marginal prior of the 
coarse quantity 7 and the conditional prior distribution of the fine scales given the coarse. 
This resulting expression is the foundation of our multiscale inference framework. The three 
densities on the right hand side of (2.3) can be understood as a coarse likelihood, a coarse 
prior, and a downscaling “prolongation” density. Notice that only the downscaling density 
involves the high-dimensional fine scale parameters 0 . 

It is trivial to remove 0 from (2.3) via marginalization, leaving the posterior density of 
the coarse parameters alone: 77(7 \y) oc 77 ( 7 / 17 ) 77 ( 7 ). We can now break sampling the fine- 
scale posterior 7r(0|y) into two steps: (1) coarse-scale inference, which samples the posterior 
77(7 \y) directly and ignores the fine scale parameters; and ( 2 ) fine-scale conditioning, which 
generates one or more samples of the fine-scale parameter from 7r(0|7^1), for each posterior 
sample 7 W. Since n(6\y,'y) = 77 ( 0 ) 7 ) under the conditional independence assumption, the 
combination of these two steps will generate samples from the joint posterior 7 r( 0 , 7 I 2 /) = 
7r(0|7, y)7r(7|y). Marginalizing out the coarse parameter (i.e., simply ignoring the coarse 
component of each joint sample) will produce samples of the fine-scale posterior 7r(0|y) = 
f tt(0 , 712/)<^7- 

While this two-step process is conceptually simple, two important issues remain: 

1. Sampling the coarse-scale posterior in principle requires evaluating the prior density 
77 ( 7 ) of the coarse parameter 7. But the original inference problem only specifies a 
prior on the fine-scale parameter 0 . 

2. Generating fine-scale posterior samples requires that we sample from the conditional 
density 77 ( 0 ( 7 ); in general cases, this task may be nontrivial. 

Both of these issues will be addressed through the construction of transport maps that 
represent the joint prior distribution of the coarse- and fine-scale quantities, with a particular 
structure described in the next section. 

3. Transport maps for multiscale inference. Transport maps are deterministic nonlin¬ 
ear variable transformations between (probability) measures [63, 64], Transport maps have 
recently been used to accelerate Bayesian inference, coupled with MCMC [52] or in a stan¬ 
dalone approach [47]. Compositions of many simple transport maps have also been used for 
density estimation in [59, 60]. Here we will use transport maps to transform the joint prior 
density 77 ( 7 ,0) into a standard normal distribution that can be easily sampled. Construct¬ 
ing this transformation with the appropriate structure will enable easy characterization of 
the coarse prior density 77 ( 7 ) and sampling from downscaling density 7 r( 0 | 7 ). 

3.1. Transport maps. To define a transport map, consider two Borel probability mea¬ 
sures on M rf , denoted by fx and v. We will call these the target and reference measures, 
respectively, and associate them with random variables x ~ /x and r ~ v. An exact trans¬ 
port map T : —>• M. d is a deterministic transformation that pushes forward /x to u, yielding 
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( 3 - 1 ) 


v(A) = v{T-\A)) 


for any Borel measurable set A C WL d . This pushforward relationship is denoted concisely 
by v = Xjj/r. In terms of the random variables, we may write r *= T(x), where =' denotes 
equality in distribution. 

Existence of a T satisfying (3.1) is guaranteed when fi has no atoms [ 8 , 44], but there 
can be infinitely many transport maps between two arbitrary probability measures. To 
regularize the problem, and for additional reasons described below, we restrict our attention 
to maps with the following lower triangular structure: 


(3.2) 


T(x l,X2, ...,x d ) 


Ti( Xl ) 

T 2 (x i,x 2 ) 

T d (xi,x 2 , ...,x d ) 


where subscripts denote components of a: € This lower triangular map is known as the 
Knothe-Rosenblatt (K-R) rearrangement. For absolutely continuous target and reference 
measures, the K-R map exists and is uniquely defined (up to ordering of the coordinates), 
and has a lower triangular Jacobian with positive diagonal entries (//-a.e .). 1 The map is 
thus a bijection between the ranges of x and r. This lower triangular structure will be 
particularly useful for our multiscale approach, as we explain in the next section. 

3.2. Exact multiscale inference. Let the reference random variable be composed of 
independent standard Gaussians, r ~ N(0,I), and let the target measure be the joint 
prior on i := ( 7 , 6). In this section, we suppose that we have a lower triangular map 
T : + d e that pushes forward the prior to the reference, i.e., 


(3.3) 


r c 

i.d. 

' r c ( 7 ) 

. r f . 


. T/M) _ 


n 1,0), 


where r c and 77 are standard Gaussian random variables with dimensions d 7 and dg, re¬ 
spectively. In Section 3.3 we will discuss how to construct such a map, but for now we 
proceed as if we have an exact transport map in hand. It will be convenient to define 
S : R'L+dfl —y R c L+'L as the lower triangular inverse of T( 7 , 9 ), such that 


(3.4) 


Sir) 


S c (r c ) 

i.d. 

7 

. S f (r c ,r f ) _ 
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We now wish to use the upper block of (3.4) to rewrite the prior and posterior on the 
coarse-scale quantity 7. 2 Effectively we will “transfer” the coarse-scale inference problem 


1 Application-specific orderings will be discussed in Section 5. More general comments on useful orderings 
of the coordinates can be found in [50]. 

2 Recall that we do not, in general, have a direct way of evaluating the prior density of the coarse-scale 
quantity 77 ( 7 ). As we will show in Section 3.3, implementing our method only requires the ability to 
generate joint prior samples of the fine-scale and coarse-scale parameters [9, 7 ). 
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to the reference variable r c via the bijection T c = S c 1 . The pullback of the prior marginal 
7 t 7 ( 7 ) through S c has the following density: 

vr 7 ( S c (r c )) | det VS' c (r c )| = p(r c ), 

where p denotes the standard normal density. Thus, knowing the transport map S c enables 
us to evaluate the coarse-scale prior density 7 r 7 . But we can go one step further, applying 
the same variable transformation to the coarse posterior 7 r 7 | ?/ (t| 2 /) 00 7T y\'y(y\'~f) 7r 'y('y) i n order 
to obtain a posterior on r c : 

(3.5) ir{r c \y) = vr 7 | y ( S c (r c )\y) | det V5 c (r c )| 

oc ( y\S c (r c )) 7 r 7 ( S c (r c )) | det V5 c (r c )| 

= TTj/i-y (y\S c (r c )) p(r c ) 

= Tr(y\r c )p(r c ). 

Parameterizing the coarse-scale inference problem in terms of r c is convenient as now the 
prior is simply standard normal. 

Next, we can use the maps (3.3) and (3.4) to generate samples from the fine-scale 
conditional density 7 r( 0 | 7 ). Since T c is a bijection, sampling from 7 r( 0 | 7 *) for a fixed value 
7 * is equivalent to sampling tt(6\ r*) when T c ( 7 *) = r*. With the help of Sf, we can simulate 
the random variable 9\r* whose density is 7r(0|r*) using the fact that 

(3.6) e\rl=' S f (r* c ,r f ). 

According to this expression, samples of 9\r* can be generated by first sampling the standard 
Gaussian 77 and then evaluating the map Sf. 

We combine these coarse- and fine-scale sampling strategies to define our complete 
multiscale framework. The conditional independence property (2.2) and the maps S and T 
allow us to sample the fine-scale posterior 9\y in two steps: 

1. Use MCMC or any other standard sampling strategy to sample the coarse posterior 
7r(r c \y) defined in (3.5). 

2. For each coarse posterior sample r*, generate one or more samples of rj from a 
standard normal distribution and evaluate Sf(r*,rj) to obtain fine-scale posterior 
samples. 

This procedure is detailed on lines 10-14 of Algorithm 1. Clearly, the maps T and S are 
critical to our approach. Next we will discuss how to construct these transformations. 

3.3. Constructing transport maps from samples. Our construction of the “forward” 
(reference to target) map S will depend on the “inverse” (target to reference) map T, so 
we will first focus on the construction of T. Consider the pullback of the standard normal 
reference measure through a candidate map T ; this pullback distribution has density 77 ( 7 , 9): 

(3.7) 7f( 7 , 0)=p (T( 7 , 9)) |det VT( 7 , 9) \, 

where VT is the Jacobian of T and det VT is the determinant of the Jacobian. To simplify 
notation, we again let x = (7 ,9), so that the target density is 77(7 ,8) = n(x). 
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We evaluate the difference between the target density n and map-induced density if 
using the Kullback-Leibler (KL) divergence: 


(3.8) 


-DKL(7r||7r) = E„ 
= E„ 


log 

log 


7 t(x) 
7 t(x) 


7 t(x) 


p(T(x)) |detVT(x)| 


= [log 7r(x) - log p(T(x)) - log |det VT(x)|], 

where E^ denotes the expectation with respect to the joint prior density 7r(x). Using (3.8), 
we will find transport maps by solving the minimization problem 


(3.9) 


minE-n- [— log p(T(x)) — log |det VT(x)|]. 


Note that log tt(x) was removed from the objective because it does not depend on T. T is the 
space of monotone, continuously differentiable, and lower triangular maps from M c b+ C b to 
M.d 7 +d fl . For a sufficiently smooth joint density 7r(x), T will contain the Knothe-Rosenblatt 
map and hence the solution to (3.9) will be an exact measure transformation, for which 
Dkl{k\\tt) = 0 . 

In most situations, the minimization problem (3.9) must be approximated in two re¬ 
spects: first, because the expectation cannot be computed exactly, and second, because 
the map might be represented in a space T C T that does not include the exact Knothe- 
Rosenblatt map. In these situations, we instead find an approximate lower triangular map 

T such that T(x) ~ r. Suppose we have K samples x^ from the joint prior 7r(x). We can 
use these samples to define a Monte Carlo approximation of the expectation in (3.9). As 
detailed in [52], one can define the approximate map T as the solution of the minimization 
problem 


(3.10) 


mm — 
Ter K 


i K 

-Y 


k =1 


dry -\-d,Q 


log p (V(x (fc) )) + ^2 lo S 


i— 1 


dT(x^) 

dxi 


d T-(xW) 

s.t. ——> A m i n > 0, k G {1,2,..., K}, i € {1, 2,..., + dg}, 


where A m i n is a small positive scalar introduced to ensure monotonicity and T is the space 
of maps spanned by a finite set of basis functions {V’l, ^ 2 , • • •, V’iv}- Note that the log- 
determinant term has been expanded into a sum and that the absolute value has been 
removed; this is a result of using a monotonically increasing lower triangular map. In this 
work, we use multivariate Hermite polynomials to represent the map. These polynomial 
basis functions will also be combined with problem-specific structure (discussed in Section 
5) to facilitate map construction in high dimensions. 

While [52] solved (3.11) directly (though in a different context, with samples generated 
via MCMC), the applications in this paper are of much higher dimension. We have found 
that additional constraints on T can help obtain accurate maps with fewer samples. Using 
the fact that the reference density p{r) is a standard normal, we constrain the output of T to 
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have unit sample variance and zero sample mean. With these constraints, the optimization 
problem for X) becomes 


(3.11) 


min 

TeTi 


s.t. 


1 

K 


E 


dry + d,Q 


E lo s 


dTj(x 

dxi 


dT(x( k h 

—A m i n > 0, k £ {1, 2,... , K}, i € {1, 2,... , d 7 + dg\ 

1 K ~ 

-Y J T l (x^) = 0 , *€{ 1 , 2 ,...,^ + ^} 

V k =1 
1 K ~ 

-Y J Tf{x^) = 1, *€{1,2,...,^ + ^}. 

fc=i 


Notice that we have removed the log p\T(x^)j term from the objective. Because the 

reference density p(-) is standard normal, Ylk=i l°gP (t(x^)^J oc — X)a-Li T 2 {x^)\ hence, 
satisfying the variance constraint ensures that this term is constant. The additional con¬ 
straints in (3.11) make it slightly more difficult to solve than (3.10), but in our experience, 
imposing the mean and variance information yields more accurate maps, warranting the ex¬ 
tra effort required during optimization. We should note that (3.11) is non-convex because 
of the quadratic constraint. However, we have not found this property to cause convergence 
issues in practice; future work might be able to explain this fact by extending the global 
minimum ideas from [46]. 

The structure of the constrained optimization problem (3.11) also enables several ef¬ 
ficient solution approaches. First, when the map parameterization is independent across 
each dimension (e.g., each component of the map is represented with its own expansion), 
the optimization problem in (3.11) is separable. Thus we can independently solve d 7 + dg 
smaller optimization problems instead of one large optimization problem. Next, we will 
ensure that each X) is linear in the coefficients of its basis representation (e.g., a polynomial 
expansion); as a result, the objective can be evaluated using efficient linear algebra routines. 

Let each component X) of the map be expressed in the form 

( 3 . 12 ) 7 )(. t ) = 'y ] ai^j(x), 

i&Ji 


where ctjj is the coefficient for the basis function V’jC*) is a multivariate Hermite 

polynomial whose degrees in each coordinate are specified by the multi-index j, and the 
multi-index set Jt defines the basis functions used for output dimension i of the map. Note 
that the triangular structure of the map can be encoded through the choices of J\. With this 
representation, optimization over T is equivalent to optimizing over the map coefficients. 
Using (3.12), we now define two Vandermonde matrices A{ and G{ containing evaluations of 
the basis functions and their derivatives, respectively. Let Ji = {ji,i, ji, 2 , ■ ■ ■ ,where 
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\Ji\ is the cardinality of J7j. Then these matrices take the form 





2 (* (1) ) • 


(3.13) 

Ai = 

'•Pji, i(z (2) ) 

^ji,2 (* (2) ) ' 

• Vy i ,| l7l |(z (2) ) 



1 

7T • 

5 




and 







(3.14) 

Gi = 



• ^(* (2) ) 







With these matrices in hand, the optimization problem from (3.11) can be written simply 

min — c T log (&V-h) 

a-i 

S.t. GiCXi T A m i n , 

C 1 AiOLi = 0 , 

aj Aj AiOCi = K, 


as 

(3.15) 


for each i € { 1 ,... , d 7 + dg}, where the log is taken componentwise, c is a length-if vec¬ 
tor of ones, and is a vector of the expansion coefficients. This problem can now be 
solved easily using any technique for constrained optimization. In particular, we use an 
augmented Lagrangian method [2, 39] to handle the constraints and a full Newton op¬ 
timizer with backtracking line search on the unconstrained subproblems. More advanced 
methods or implementations like IPOPT [ 66 ] or ADMM [7] might reduce the computational 
effort needed to solve the optimization problem in (3.15). In our experience, however, the 
main computational cost of map construction does not lie in solving (3.15), but rather in 
generating the K prior samples of x = ( 7 , 6) needed to define (3.13) and (3.14). 

3.4. From inverse map to forward map. Recall that the posterior sampling procedure 
in Section 3.2 requires evaluations of the maps S c and Sf in (3.4), which push forward the 
standard normal reference distribution to the joint prior distribution 7 t(.t). As proposed 
in [52], taking advantage of the lower triangular structure, it is possible to evaluate the 
inverse of T (that is, S) using a sequence of one-dimensional polynomial solves. However, 
when many evaluations of S are required, it is more efficient to approximate S directly and 
to evaluate this approximation without explicitly inverting T. We now define a regression 
procedure that constructs an approximation S' of 5, using the map T and the samples 
{x^}. 

Using each joint sample x^ k \ we can compute r ^ = T (x^^j to obtain sample pairs 
corresponding to the input and output of S. With these pairs, we can construct S with 
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standard least squares regression. The least squares objective is 


(3.16) 


min 

s 




k =1 



As with T, we represent the triangular map S using an expansion of multivariate Hermite 
polynomials. This allows us to find the coefficients of the map that minimizes (3.16) using 
standard linear least squares techniques (i.e., the QR decomposition of a Vandermonde 
matrix). The convergence properties of similar regression-based maps were studied by [57] 
in a discrete optimal transport setting. 

3.5. A complete algorithm for multiscale inference. The entire process of generat¬ 
ing fine-scale posterior samples using our multiscale inference framework is described in 
Algorithm 1. 


Algorithm 1: Overview of the entire multiscale inference framework. 

Input: A way to sample the fine-scale prior distribution 7 t(<9 ); a way to sample the upscaling 
distribution 7 r( 7 | 0 ); the number of coarse posterior samples A; and the number of fine 
samples per coarse sample M. 

Output: Samples of the fine scale posterior ir(9\y) 

/* Step 1: Generate prior samples */ 

1 for k t— 1 to K do 

2 Sample 8^ from n(8) 

3 Sample 7 ^ given 9^ 


/* Step 2: Compute the forward map T 

4 for it— 1 to do + d 7 do 

5 Solve (3.15) to get Ti 


*/ 


/* Step 3: Generate sample pairs and solve for the inverse map S 

6 for k t— 1 to K do 

7 L =T(7 (fc) ,0 (fc) ) 

8 Solve (3.16) to get S c and Sf 

/* Step 4: Sample the coarse posterior 

9 Generate | ri X \ri 2 \ ... ,ri N ^ > from n(r c \y) (3.5), using MCMC or another sampling method 


/* Step 5: Generate fine scale posterior samples 
10 for it— 1 to N do 

for j t— 1 to M do 

Sample from a standard Gaussian 


14 return 


Posterior samples I# 1 - 1 ), 9 ^,..., | 


*/ 


*/ 


*/ 


3.6. Choosing the number of fine scale samples. After sampling the coarse-scale pos¬ 
terior (line 9 of Algorithm 1) , we have a set of (possibly correlated) samples {rc^, ,..., r 

11 








from ir(r c \y). The next step is to “prolong” these samples back to the fine scale by sampling 
7r(0| Tc) for each i. If our goal is to minimize the computational effort required to estimate 
the posterior expectation of a 0-dependent quantity with a certain accuracy, it is useful to 
consider how the variance of the estimator depends on the number of fine scale samples M 
produced for each coarse scale sample. There is a tradeoff between reducing the coarse- 
scale contribution to the variance (by increasing N) and reducing the conditional fine-scale 
contribution to the variance (by increasing M). The optimal choice depends on the com¬ 
putational cost of generating each kind of sample, on the degree of correlation among the 
coarse-scale samples, and on the magnitudes of the variances on the coarse and fine scales. 

For simplicity, suppose that we are interested in the fine-scale posterior mean E[0|y]. 3 
Using M fine scale samples for each of the N coarse samples, a Monte Carlo estimator of 
E[%] is 

N M 

(3T7) 0(r c ,r f ) = E E > 

i =1 j =1 

where r c is a set of N correlated samples of it(r c \y) and rf is a set containing the NM fine 
scale samples. 

We wish to choose M in order to minimize the variance of the estimator 9 for a given 
computational effort. This variance can be expanded (using the law of total variance) as 


(3.18) Var rciI . f 

0(r c ,r f ) 

= Var rc 

E rf |0(r c ,r f )|r c | 

+ E rc 

Var rf | 

0(r c ,r f )|r c | 


Ci C 2 

N + NM ’ 

where C\ and C 2 are constants depending on the form of 7r(0| r c ) and ir(r c \y), and on the 
degree of autocorrelation in r c (e.g., how well the coarse MCMC chain mixes). More details 
on derivation of Ci and C 2 can be found in [50]. In Section 6, we will discuss the estimation 
of Ci and C 2 ■ 

The expression in (3.18) is quite intuitive: part of the estimator variance stems from 
limited coarse posterior sampling (the Ci term) and part of the variance is a result of 
limited coarse-to-fine sampling (the C 2 term). However, the coarse samples have a different 
computational cost than the fine samples. With this in mind, we now try to find the values 
of N and M that minimize Var rc . rf [0(r c , rf)] for a fixed computational cost. 

Let f t ot be the total sampling time, which for this discussion is fixed a priori. Let t c 
be the average time it takes to generate one coarse sample and let t{ be the average time 
required to generate a fine sample from vr(0|r c ) using 5/(r c , r/). Because t to t is fixed, t c and 
tf must satisfy the constraint 


(3.19) 


ttot = t c N + t { NM. 


Solving (3.19) for M, plugging the result into (3.18), and minimizing over N. we find that 
minimum variance under the constraint (3.19) occurs at the following optimal value of N: 


(3.20) 


N* = 


ttot (Cite — \JC\ C' 2 t c tf) 
Citl - C 2 t c tf 


3 The analysis in this section can be extended to the estimation of posterior expectations of more general 
functions h(9) of the fine scale parameters. 
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Figure 1 . Optimal M from (3.21) for varying sample costs t c and tf. C 1 = 1 and C 2 = 0.7 are fixed in 
this illustration. 


which corresponds to an optimal value of M: 

Cit c — C 2 tj _ 1 

(Cit c — VCiC 2 t c t { ) 

Notice that the optimal number of fine samples M* does not depend on the number of 
coarse samples N or the total time itot- As we will show in Section 6 , the values of C\ and 
C 2 can be estimated and (3.21) can be used as a guideline for choosing M. The qualitative 
behavior of (3.21) is also informative. Figure 1 illustrates M* for varying t c and tf, with 
fixed Ci and C 2 . As one would expect, when fine samples are less expensive than coarse 
samples (tf < t c ), it is usually advantageous to produce more than one fine sample per 
coarse sample. This advantage diminishes as tf —>• t c . 

4. A small proof-of-concept example. For illustration, we now consider a small multi¬ 
scale inference problem with only two fine scale parameters, 9\ and 0 2 . The fine scale prior 
is a bivariate Gaussian with zero mean and identity covariance. The hne-to-coarse model 
defines the coarse-scale quantity 7 as a modified harmonic mean of two a priori log-normal 
random variables associated with the fine scale: 

7 1 + exp(-#i) + exp(- 0 2 ) + 

where yj ~ N(— 0.3,1.5 x ICG 3 ). The data y € M are related nonlinearly to the coarse 
quantity through 

(4.2) y = atan(y) + ?y c , 

where i] c N( 0,10 2 ). We have thus defined ir(9i,6 2 ), 7 r( 7 \9i,9 2 ), and 7 r(y| 7 ). 

We solve this problem using Algorithm 1, producing samples from the posterior n(9i, 9 2 \y). 
The maps T and S are represented with total-degree truncated multivariate Hermite poly¬ 
nomials. Figure 2 and Table 1 compare the true posterior density with the posterior density 
obtained using third, fifth, and seventh degree maps. Complete implementation details and 
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(3.21) 


M* = 


Cf 



















(a) Third degree 


(b) Fifth degree 


(c) Seventh degree (d) True posterior 


Figure 2. Convergence of the multiscale posterior to the true posterior as the total polynomial degree is 
increased. In all cases, 150 000 samples ofn(-y,6) were used to build the maps. 

Table 1 

Convergence of posterior from multiscale method on small test problem. The KL divergence between 
the true posterior n and the multiscale posterior n was computed using quadrature with the exact posterior 
density and a kernel density estimate of the multiscale posterior. The values reported here are the average 
KL divergences obtained from 30 runs of the multiscale method. 


Map Order 

13 5 7 

Error, Dkl [tt 7t] 

2.366e-01 5.549e-2 3.507e-2 3.407e-2 


code for this example can be found at http: //muq.mit. edu/examples. As the map degree 
is increased, the posterior density converges to the truth. This convergence is also seen in 
Table 1. Note that some of the small wiggles in the plots are artifacts of the kernel density 
estimator used to visualize the multiscale posterior densities. 

5. Application in simple groundwater flow. To illustrate the accuracy and performance 
of our multiscale approach, we now consider an inverse problem from subsurface hydrology. 
Our goal is to characterize subsurface structure by inferring a spatially distributed conduc¬ 
tivity field using limited observations of hydraulic head. An elliptic equation, commonly 
called the pressure equation, will serve as a simple steady-state model of groundwater flow 
in a confined aquifer: 

(5.1) -V ■ (k(s)V/i(*)) =/(*), 

where x E C M 0 * is a spatial coordinate in d = 1 or d = 2 dimensions, k(x) is the hydraulic 
conductivity field we wish to characterize, f{x) contains well or recharge terms, and h(x) 
is the hydraulic head that we can measure at several locations throughout the domain. See 
[5] for a derivation of this model and a comprehensive discussion of flow in porous media. 

The elliptic model in (5.1) acts as a nonlinear lowpass filter, removing high frequency 
features of k(x) from h(x). This means that some features of k(x) cannot be estimated 
even when h(x) is observed with high precision. Variational discretization methods such 
as multiscale finite element methods (MsFEM) [1, 29], multiscale finite volume methods 
[36], variational multiscale methods [31, 37], heterogeneous multiscale methods [17], and 
subgrid upscaling [3] take advantage of this smoothing to create a smaller, easy-to-solve, 
coarse scale problem. The common idea behind all of these strategies is to (implicitly or 
explicitly) coarsen the elliptic operator in a way that allows for more efficient, yet accurate, 
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solutions. In the examples below, MsFEM will be used to solve (5.1) and to define the 
coarse parameter 7 . 

5.1. Multiscale finite element method. Here, we present a very brief description of 
MsFEM, with only enough detail to understand its use in the multiscale inference setting. 
See [29] for a comprehensive discussion of MsFEM theory and implementation. 

Let H be the spatial domain on which we want to solve (5.1), and let V be a mesh 
discretization of fh Then, for a suitable function space X, the weak formulation of (5.1) 
seeks u G X such that 

/ k(x)Xu ■ Xvdx = / fvdx , Vu € X. 

Jn J o 


Projecting u and v onto a finite number of nodal basis functions </> 2 ,..., 4>b} associated 
with V yields a finite-dimensional linear system Au = b, where it is a vector of coefficients 
of the basis functions and 


(5.2) 


A 


v 



K{x)X<fii ■ Xcj)jdx = 


) / k(x)V fa ■ V(f)jdx. 

cev Jc 


Here C is an element in the discretization V of H. We refer to the quantity 

eij,c = / k(x)X4>i ■ V<f>i dx 

Jc 

as an elemental integral. Note that the elemental integrals are assembled to construct A. and 
thus are sufficient to characterize the coefficients u —and hence the values of the pressure 
head h(x) at the coarse element nodes. This property will be important for the use of 
MsFEM in the context of inference. 

For the pressure equation in (5.1), the key difference between an MsFEM formulation 
and a standard continuous Galerkin formulation lies in the choice of basis functions {&}■ In 
the standard Galerkin setting, one might use hat functions or another standard basis. In the 
MsFEM context, however, the basis functions are chosen to satisfy a homogeneous version 
of (5.1) over each element C. These basis functions depend on—and thus encode—local 
fine-scale spatial variation in the coefficient k(x). The advantage of this choice is that a 
good approximation of h(x) can be achieved with a coarse discretization V and subsequently 
a smaller linear system. See [29], [35], or [36] for important implementation details, such as 
the choice of boundary conditions when solving for the MsFEM basis functions. 

5.2. Multiscale framework applied to MsFEM. Recall that our goal is to accelerate 
inference of the conductivity field k(x), which may contain fine-scale spatial features that 
are represented with a high-dimensional discretization. The MsFEM approach implies that 
elemental integrals are sufficient to describe the head /i(x ), 4 which is equivalent to the 
conditional independence assumption in (2.2). The elemental integrals can therefore be 
used to define the coarse parameters 7 in the multiscale inference framework. The exact 

4 Given the elemental integrals { e.ij,c }, we can solve directly for h{x ) at the nodes of the coarse dis¬ 
cretization V. We will thus construct the coarse mesh so that observations occur only at the coarse nodes. 
Representing h{x ) at other points in space requires explicit access to the MsFEM basis functions 
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relationship between the elemental integrals e^c 1 and the coarse parameters 7 will depend 
on the spatial dimension (one or two in our case) and will be discussed below. In all cases, 
however, the fine scale parameter will be defined as 8 = log n. Here and below, when we 
omit the argument x and write k rather than k(x), we refer to a discretized version of the 
conductivity, k € R ds . 

The large dimension of 8 makes this problem interesting, but at the same time makes 
construction of the transport map difficult. For example, a map of total degree three in 
110 dimensions will have 234136 polynomial coefficients! Such a general form for the map 
is infeasible, and a more judicious choice of map parameterization is required. 

5.2.1. Strategies for building maps in one spatial dimension. In one spatial dimension, 
MsFEM produces one independent elemental integral per coarse element. Thus the coarse 
parameter dimension d 7 is equal to the number of coarse elements. The log-conductivity 8 
has finer scale features, so its dimension dg is typically much higher. Let n be the number 
of fine elements in each coarse element, so that dg = nd 7 . In the numerical example of 
Section 6.1, we will use d 7 = 10 and dg = 100, son = 10. 

Now consider the impact of these dimensions on the map representation. A polynomial 
map of total degree three in 0(10) dimensions is straightforward to handle, for example, 
and thus we do not need to employ any special truncations or structure in defining the 
coarse-scale maps T c and S c ] polynomial maps of moderate degree will suffice in practice. 
The coarse-to-fine map Tf, however, would be a function from R 110 to R 100 in the example 
mentioned above. A generic total-order polynomial representation for such a function is 
not tractable. Instead, we will take advantage of spatial locality to construct a much more 
efficient parameterization of Tf. 

First, let us endow 6 with a Gaussian prior. (This does not sacrifice generality, as other 
prior distributions can be written as deterministic transformations of this Gaussian; indeed, 
we are actually using a log-normal prior on the conductivity k(x), since 8 = logK.) Now 
each of 8, r c , and 77 are multivariate Gaussians. While this does not imply that 8 , r c , 
and 77 are jointly Gaussian, it does suggest that a linear map may characterize much of 
the joint structure between these random variables, allowing localized nonlinearities in the 
map to characterize non-Gaussian features of the joint distribution. Consider the vector of 
reference random variables 

r= [r c ,r/] T = [771, r C)2 ,..., r cA , r fA , r /)2 ,..., r fAe \ T . 

We begin with a linear representation of the map Tf and then enrich the collection of linear 
terms with selected nonlinear terms. The initial set of linear multi-indices for output i of 
T f is 

(5.3) J} = {j : j € Ng 7+i , ||j||i <l} for i = 1... dg. 

Now, to introduce some nonlinear structure, we will take advantage of spatial locality in the 
definition of the coarse quantities 7 according to MsFEM. Every component of r is spatially 
related to one element in the coarse mesh and thus one component of 7. Specifically, for 
our structured mesh, component i of the fine scale field 8 is related to component p(i) of 
the coarse parameter, where 

(5.4) p{i) = [i/n\ + 1. 
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Thus, to introduce local nonlinear terms for the z th output of Sf, we will include nonlinear 
terms in the inputs and r/ 7 . Similarly, the z th output of Tf will be nonlinear in 

and 0{. Combining these terms with the linear multi-indices yields a multi-index set tuned 
to the coarse/fine quantities in our one-dimensional application of MsFEM: 

(5.5) Ji = j} U |j : j € N[j 7+ \ ||j||i < P, j k = 0 for k g {p(i),i + d 7 }j . 

where P is the maximum polynomial degree of the nonlinear terms. To help assure mono¬ 
tonicity of the map, P should be odd. When constructing maps from a finite number of 
samples K, P should also be chosen no larger than needed to adequately Gaussianize the 
target samples; this value is problem-dependent, of course, but in practice is often as low 
as P = 3 . 5 This localized multi-index for Tj will result in nonlinear terms and interactions 
between jpU) and 9i, and linear terms for all components of 7 and components 9 k with 
k < i. The localized set J % will contain i + d 7 + (P + 1 )(P + 2)/2 terms, whereas a standard 
total degree multi-index set would require terms. 

A further simplification occurs if we temporarily restrict our attention to P = 1 in the 
multi-index set above. Now the fine-scale maps Sf and Tf are completely linear. (The 
smaller coarse-scale maps S c and T c can remain nonlinear.) While the optimization and 
regression approach from Section 3 can still be used, directly appealing to cross-covariance 
matrices can enable more efficient construction of Sf when the prior tt(9) is Gaussian, 
9 ~ N(/j,q, £ 00 ). Recall that when constructing S c using regression, samples of 7 are pushed 
through T c to obtain corresponding samples of r c : = T c ( 7 ®). Furthermore, because 

prior-distributed samples 7 ® are generated by sampling the conditional 7 W ~ 7 r ( 7 | 0 «), 
each reference sample is matched with a sample 9 ® from the prior 7 r(0). We know 
that r c and 9 are individually Gaussian. For this special linear-map case, we can make the 
further assumption that they are jointly Gaussian, or at least can be well-approximated as 
such. Under this temporary assumption, an empirical estimate of the cross-covariance of r c 
and 9 , denoted by £ rc 0 , can be used to define Sf. Then the conditional distribution of 9 
given r c is simply 

(5.6) 7 r( 0 |r c ) = N (ji 6 + T,J c gr c , £00 - £^£^ 0 ) , 
which implies that Sf is given by 

(5-7) 9 = S f (r c , 77) = pe + ^J c e r c + (£ ee ~ £^£^ 0 ) 1 r f . 

The map in (5.7) is similar to an ensemble Kalman filter (EnKF) update [21]. In fact, the 
Kalman gain would be given by £ rc 0 and the observation matrix by H = . However, 

our approach estimates H from samples, while the EnKF estimates £ 00 . 

In our numerical experiments, we have observed that this method of constructing a linear 
Sf is much more efficient than the more general optimization and regression approach from 
Section 3. Just as importantly, in our applications, this linear Sf map seems to give the same 

J For more information on assessing the quality of transports, e.g., via tests of Gaussianity, see [42]. 
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posterior accuracy as a linear Sf produced with optimization and regression. The accuracy 
and efficiency tables in Section 6 will illustrate the performance of this cross-covariance 
approach for constructing the map, and compare it to the nonlinear P > 1 case. 

5.2.2. Strategies for building maps in two spatial dimensions. In two spatial dimen¬ 
sions, we use quadrilateral coarse elements, resulting in ten elemental integrals per coarse 
element. Among these ten quantities, however, there are only six degrees of freedom. The 
six degrees of freedom are coefficients of an orthogonal basis for the elemental integrals . 6 As 
the number of coarse elements increases, the number of coarse quantities quickly becomes 
too large to tackle with a simple total-degree map. We will again use the problem struc¬ 
ture to find a more tractable representation for T c and S c . In particular, we will restrict 
our attention to problems with stationary prior distributions and use this stationarity in 
combination with the locality of MsFEM. For convenience of notation, let V = d 7 /6 denote 
the number of coarse elements in our 2D discretization. 

As in the one-dimensional formulation above, we will again combine r c and rj into a 
single vector, but now the expression 

r = Vc r f ] T = [r c> i, r C)2 , • • •, r c y, r fi i, ry j2 , • ■ ■, r J4e \ T , 

represents the coarse reference random variables in blocks. That is, r C i E R 6 contains the 
six components of r c corresponding to coarse element i. On the other hand, each is a 
scalar. Similar to r c , 7 = [ 71 , y 2 ,... , 7 y], with each 7 * E R 6 , represents a block definition 
of 7 . Each of the six scalar components of 7 \ (i.e., 7 ^, E R, k = 1... 6 ) represents a 
particular coefficient of the orthogonal basis for the elemental integrals in coarse element 
i. With a stationary prior on 6, we will have a stationary prior on 7 . Stationarity implies 
that the marginal distribution of any 7 j & will be the same for all i and, moreover, that the 
six-dimensional distribution of the coefficients is the same across elements: 

(5-8) 7*=' 7j, 


for i. j E {1,2,... , V}. We will exploit this structure to build S c . 

First, consider a map S'” 1 that pushes a 6-dinrensional standard Gaussian to the prior 
marginal distribution vr( 7 j). This map is 6-dinrensional regardless of how many coarse ele¬ 
ments are used, and it captures the nonlinear dependence among the six coarse degrees of 
freedom. Now, assume that we have constructed S™ and its inverse T™ using the optimiza¬ 
tion and regression approach from Section 3. Using T™, we can define the random variable 
r™ E R^ 7 as 


(5.9) 


T 1 ”( 7 i),T c ” 1 ( 72 ),...,r c ” 1 ( 7 V') 


i T 


Notice that each block of r™ is marginally a standard Gaussian with iid components, but 
the entire variable r™ is not; correlations between coarse elements remain in r™ (due to 


6 We find this basis by taking an SVD of the matrix containing prior samples of the elemental integrals. 

This matrix is rank-deficient. The reason that there are only six degrees of freedom can be understood by 
considering various symmetries of the problem. 
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correlations in the fine scale field). To remove these correlations, we use a lower triangular 
Cholesky decomposition of the covariance of r™ given by 

(5.10) Cov[r c m ] = ii T . 


The lower triangular Cholesky factor L can itself be divided into blocks corresponding to 
each of the coarse elements 


(5.11) 

L = 

L\,i 

L>2,1 

0 

^2,2 

1- 

OO 

O O 




L(V-1),2 

• • • 0 



Ly,i 

Ly, 2 

W,(v-i) L v,v 


where each diagonal entry is a 6 x 6 lower triangular matrix. Notice that applying L 1 to 
r™ will remove linear correlations from r™, leading to 

(5.12) r c = L-V™ =* Lr c = r™. 

Combining L with the local nonlinear map S™, the entire coarse map S c is defined as 


(5.13) 


' S?(Li,ir Ctl ) 

S™ (T 2 ,ir C) i + L2,21"0,2) 

- S™ {Lv,ir c ,i + Ly, 2 T c ,2 + ... + Lyyr c y) . 


Importantly, constructing this map only requires building a nonlinear map in six dimensions, 
which can be accomplished easily with total degree polynomial expansions. 

In the two-dimensional example below, Sf is constructed using the cross covariance 
approach described in Section 5.2.1 for the one-dimensional problem. The samples of r c 
used in the sample covariance £ rc 0 are computed using the nonlinear inverse map T" 1 
composed with L -1 . 


6. Numerical results. 

6.1. One spatial dimension. Here we apply our multiscale framework and the previous 
section’s problem-specific map structure to our first “large-scale” inference problem. The 
goal of this section is to analyze the efficiency and accuracy of our multiscale inference 
strategy by comparing it with a standard MCMC approach. The inverse problem involves 
inferring the spatially distributed conductivity field in (5.1) from noisy observations of the 
hydraulic head, using MsFEM as the forward model. The spatial domain is one-dimensional, 
il = [0,1]. While our multiscale approach can handle much larger problems (see Section 6.2), 
the problem size is restricted in this example in order to enable comparison with full¬ 
dimensional MCMC. 

We wish to sample the posterior distribution n(9\y), where 6 = logK is the discretized 
log conductivity field and the data y are a set of pointwise head observations. We use a 
Gaussian prior on 9 with zero mean and exponential covariance kernel 

Cov (6(xi),9(x 2)) = &g exp 


\xi - x 2 


( 6 . 1 ) 
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We set the correlation length to L = 0.1 and the prior variance to a | = 1.0. An exponential 
kernel was chosen for two reasons: (i) this class of covariance kernel yields rough fields that 
are often found in practice, but difficult to handle with dimension reduction techniques such 
as the Karhunen-Loeve decomposition; and (ii) MsFEM is more accurate for problems with 
stronger scale separation, which is the case for rougher coefficient fields. We use 10 coarse 
elements and n = 10 fine elements per coarse element. Thus 9 is a 100-dimensional random 
variable. The data y are 9-dimensional, coming from observations at the interior nodes of 
the coarse mesh. Zero Dirichlet boundary conditions are imposed: h( 0) = h( 1) = 0. To 
generate the data, a realization of the prior log-conductivity (shown in Figure 4) is combined 
with a very fine-mesh FEM forward solver to produce a representative head field. The head 
field is then down-sampled and combined with additive iid Gaussian noise to obtain the 
data. The noise has zero mean and variance ICG 4 . 

Benchmark results are obtained by running MCMC on the full 100 dimensional rep¬ 
resentation of 9 , with MsFEM as the forward model. We use two variants of MCMC in 
our tests: the delayed rejection adaptive Metropolis (DRAM) MCMC algorithm [25] and 
a preconditioned Metropolis-adjusted Langevin algorithm (preMALA) [55]. The DRAM 
algorithm is tuned to have an acceptance rate of 35%. Two stages are used for the DR 
part of DRAM, but the second stage was turned off after 7 x 10 4 MCMC steps. We set 
the covariance of the preMALA proposal to the inverse of the Gauss-Newton Hessian at the 
posterior MAP point. The preMALA algorithm also uses gradient information to shift the 
proposal towards higher density regions of the posterior. For the single-scale posterior here, 
finite differences were used to compute the Hessian, which may have hindered preMALA 
performance in Table 3. For both preMALA and DRAM, we run long chains of 5 x 10 6 
samples, discarding the first 10 5 steps as burn-in after starting the chain from the MAP 
point. Because the DRAM chain seems to mix better than the preMALA chain, we use the 
former for the accuracy comparison in Table 2. 

In the multiscale approach of Algorithm 1, we must sample the coarse posterior ir(r c \y). 
We do so using preMALA with the Hessian at MAP as a preconditioner; for this target 
distribution, preMALA was found to be more efficient than DRAM. preMALA was tuned 
to have an acceptance rate of around 55%. 

When the multiscale definition in (2.2) is completely satisfied (as it is in this case) and 
exact transport maps are used, posterior samples produced by our multiscale framework will 
be samples from n(9\y). As described in the preceding sections, however, various approxi¬ 
mations are required to efficiently compute the transport maps. Hence in this application, 
our multiscale method produces only approximate samples of n(9\y). Table 2 and Figure 
3 show that this approximation is quite good. Table 2 reports quantiles of the marginal 
posterior n(0i\y) for 9i at particular spatial locations. The “exact” quantiles are taken 
to be those produced by the long full-dimensional DRAM run. These are compared with 
quantiles computed using ^-samples from our multiscale framework. The multiscale 95% 
intervals in Figure 3 are computed from the mean and variance of 50 independent runs of 
the multiscale method. Each run used 5 x 10 4 prior samples to construct the maps, and 
then generated 10° posterior samples of 9, which were used to estimate the quantiles. Im¬ 
portantly, these posterior quantiles provide a diagnostic that is sensitive to non-Gaussian 
posterior structure. 

As shown in Figure 3, there is a negative bias in the results near x = 0.3. This is 
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likely caused by the approximation of Sf for parameters near that point. A coarse element 
boundary exists at x = 0.3 and there is large dip in the true value of 9 over this boundary. 
Such large dips do not occur in high-density regions of the prior, and restricting Sf to have 
only local nonlinearities might prevent the map from adequately capturing the tail behavior 
necessary to exactly characterize this posterior. With a more expressive coarse-to-fine map 
Sf, this bias would decrease. In all other locations, however, the true MCMC posterior and 
the multiscale posterior are in excellent agreement. 

Table 2 

Estimated bias in posterior quantile estimates for the multiscale inference framework, at different spatial 
locations x. E a is the average error (i.e., the bias) between the full-dimensional MCMC estimate and 
multiscale estimate of the a/100 quantile. Focusing on the median (a = 50), biases do not seem to change 
significantly as the degree of Sf changes. Values of E 50 are more sensitive to the degree of S c . 


S c degree 

Sf degree 

X 

Eq 5 

E25 

E50 

F75 

F95 



0.1 

l.lle-02 

4.72e-02 

5.92e-02 

5.67e-02 

3.35e-02 


1 

0.3 

-3.58e-01 

-3.05e-01 

-2.83e-01 

-2.75e-01 

-2.83e-01 


0.5 

-1.28e-01 

-1.40e-01 

-1.62e-01 

-1.95e-01 

-2.60e-01 

1 


0.9 

1.86e-02 

6.46e-02 

7.85e-02 

7.92e-02 

5.02e-02 


0.1 

2.16e-01 

1.26e-01 

5.42e-02 

-2.58e-02 

-1.53e-01 



0.3 

-1.49e-01 

-2.21e-01 

-2.80e-01 

-3.44e-01 

-4.42e-01 


O 

0.5 

7.50e-02 

-5.98e-02 

-1.61e-01 

-2.67e-01 

-4.23e-01 



0.9 

2.45e-01 

1.54e-01 

7.91e-02 

-4.53e-03 

-1.35e-01 



0.1 

-9.24e-04 

3.60e-02 

4.77e-02 

4.76e-02 

2.28e-02 



0.3 

-3.60e-01 

-3.05e-01 

-2.84e-01 

-2.73e-01 

-2.79e-01 



0.5 

-1.00e-01 

-1.12e-01 

-1.34e-01 

-1.69e-01 

-2.34e-01 

3 


0.9 

2.17e-03 

4.84e-02 

6.26e-02 

6.31e-02 

3.53e-02 



0.1 

2.15e-01 

1.22e-01 

4.17e-02 

-4.46e-02 

-1.90e-01 



0.3 

-4.11e-02 

-9.15e-02 

-1.32e-01 

-1.78e-01 

-2.54e-01 


O 

0.5 

1.06e-01 

-3.52e-02 

-1.39e-01 

-2.49e-01 

-4.15e-01 



0.9 

2.42e-01 

1.30e-01 

4.74e-02 

-3.97e-02 

-1.79e-01 


Marginal posterior diagnostics such as quantiles only tell part of the story. Another 
important feature is the correlation structure of the posterior realizations. As shown in 
Figure 4, our multiscale approach correctly produces posterior samples with the same fine- 
scale roughness as the prior. 

Now consider the computational efficiency of the multiscale method. The effective 
sample size (ESS) is one measure of the information contained in a set of posterior samples. 
The ESS represents the number of effectively independent samples contained in the set. In 
an MCMC context, we can easily compute this quantity for a chain at equilibrium [69]. 
Here, however, we will use a more direct definition of ESS using the variance of a Monte 
Carlo estimator. Suppose that we have a Monte Carlo estimator 6i of the mean of 6. L . The 
ESS for such an estimator is given by the ratio of the variances of the target random variable 
and the estimator: 


( 6 . 2 ) 


ESS; 


Var(flj) 

Var(£) 


Here ESS* denotes the effective sample size for dimension i of 6\ ESS can differ for each 
dimension, and we will typically report the minimum and maximum ESS* for i = 1... dg. 
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(a) 95% region of multiscale quantile estimator (shaded blue) using cross-covariance map, compared 
to “benchmark” MCMC quantile. 




(b) 95% region of multiscale quantile estimator (shaded red) using local polynomial map, compared 
to “benchmark” MCMC quantile. 

Figure 3. Comparison of multiscale estimates of posterior 5%, 50%, and 95% quantiles with a fine-scale 
MCMC approach. The fine-scale MCMC chain was run for 4.9 million steps and the resulting quantiles are 
taken as the “true” quantiles in our analysis. Note that the vertical grid lines correspond to coarse element 
boundaries. A quantitative summary of these plots is given in Table 2. 


This expression for ESS is more costly to compute than methods based on MCMC auto¬ 
correlation [69] because evaluating Var (0j) requires many independent realizations of the 
Monte Carlo estimator (i.e., running the inference procedure many times). But this ap¬ 
proach is less susceptible to errors stemming from autocorrelation integration and does not 
require us to use ordered samples like those from an MCMC scheme. 

Table 3 shows the efficiency of our multiscale approach. Comparing full-scale DRAM 
MCMC with the multiscale results, we see that even when a nonlinear coarse-to-fine map 
Sf is used, proper tuning of the method can speed up the number of effectively independent 
samples generated per second by a factor of 2 to 5, with one fine sample generated per 
coarse sample (M = 1). When a linear Sf is employed, we can see speedups of 4.5 to 9 
times, using M = 5. These results indicate that as long as minor approximations to the 
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Figure 4. Comparison of posterior realizations with true log conductivity. The posterior samples main¬ 
tain the same fine-scale correlation structure as the true log conductivity. As in Figure 3, the vertical grid 
lines correspond to coarse element boundaries. 


posterior are acceptable, there is a clear advantage to using our multiscale approach. 

Table 3 

Comparison of posterior sampling efficiency between full-dimensional MCMC and variants of our mul¬ 
tiscale framework. The key column is ESS/t on (effectively independent samples generated per second), where 
higher numbers indicate better efficiency. The time t on measures the computation that must be performed 
after a particular value of d is observed; it does not include map-construction time. 


Method 

N 

M 

ton (SGC) 

ESS 

Min Max 

ESS/t orl 
Min Max 

MCMC-DRAM 

4900000 

NA 

2252.63 

6340 

11379 

2.8 

5.1 

MCMC-PreMALA 

4900000 

NA 

2773.47 

274 

729 

0.1 

0.3 

Cross Covariance 

500000 

1 

287.31 

2987 

20244 

10.4 

70.5 

Cross Covariance 

500000 

5 

314.17 

3971 

14597 

12.6 

46.5 

Local Cubic 

450000 

1 

937.41 

5408 

23679 

5.8 

25.3 

Local Cubic 

450000 

5 

3555.05 

5294 

18759 

1.5 

5.3 


Using the timing and ESS data from Table 3 for M = 1 and M = 5, we can also compute 
the optimal number of fine samples to generate for each coarse sample. To deploy the 
optimal expression in (3.21), we first use a least squares approach to compute the unknown 
coefficients C\ and C 2 - For the linear case, we obtain C\ = 22.7867 and C*2 = 10.2019, 
which yield an optimal value of M = 4. For the local cubic case, we obtain C\ = 11.6076 
and C*2 = 3.3135, which yield an optimal value of M = 1. It is worth generating additional 
fine-scale samples for the inexpensive linear map, but for the slightly more expensive cubic 
map, the time to generate more fine-scale samples is not worthwhile. This time would be 
better spent generating coarse samples. These values for M are dependent on the cost of 
each coarse model evaluation. In this one-dimensional problem, the coarse model is very 
cheap to evaluate—on par with the cost of evaluating the cubic map. However, for problems 
with more expensive model evaluations or with poorer coarse MCMC mixing, this will not 
be the case and larger values of M will be optimal. 
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6.2. Two spatial dimensions. The relatively small dimension of 9 in the one-dimensional 
problem above allowed us to compare our multiscale approach with very long “benchmark” 
MCMC runs. However, we expect our multiscale inference approach to yield even larger 
performance increases on large-scale problems where direct use of MCMC may not be fea¬ 
sible at all. Here we will again infer a log conductivity field in the elliptic equation (5.1); 
however, this example will have two spatial dimensions. The 2D grid is defined by an 8 x 8 
mesh of coarse elements over [0,1] x [0,1], with 13 x 13 fine elements in each coarse element. 
The log-conductivity is piecewise constant on each fine element, resulting in a 10816 dimen¬ 
sional inference problem. The zero mean Gaussian prior is again defined by an exponential 
kernel with correlation length 0.1. In two dimensions, this kernel takes the form 

(6.3) Cov (9{x\), 0(x2)) = cr| exp 

where || • ||2 is the usual Euclidean norm, <jg = 1.0, and L = 0.1. Notice that this kernel is 
isotropic but not separable. 

Synthetic data are generated by a full fine-scale simulation using a standard Galerkin 
FEM with additive iid Gaussian noise added to observations at each of the coarse nodes. 
The noise variance is 10~ 6 . Homogeneous (i.e., no flow) boundaries are used for x = 0 and 
x = 1. Dirichlet conditions are used on the top and bottom of the domain. These conditions 
are given by 


xi - X 2 \\2 
L 


h(x, y = 0) = x 
h(x, y = 1) = 1 — x. 


Since we cannot reasonably apply any other sampling method to this large problem 
directly, our confidence in the accuracy of the posterior can be judged from the accuracy 
of the transport maps S c and Sf. From the one-dimensional results, we know that linear 
Sf derived from the cross covariance of r c and 9 performs quite well; it is reasonable to 
assume that the same is true in the 2D case. A qualitative validation of the coarse map S c 
is given in Figure 5. The figure shows the true prior density of the coarse parameters over 
one coarse element, as well as the density induced by the coarse map, S c in (5.13). We see 
that the coarse map represents the coarse prior quite well. In terms of computational effort, 
the map was constructed using 85000 prior samples and a multivariate Hermite polynomial 
representation of S™ of total degree seven. Using the MIT Uncertainty Quantification 
(MUQ) library [51], and taking advantage of 16 compute nodes, each employing four threads 
on a cluster with 3.6 GHz Intel Xeon E5-1620 processors, prior sampling, S c construction, 
and Sf construction took less than one hour for this problem. 

With confidence in the transport maps, we can move on to posterior sampling. The 
preMALA MCMC algorithm, using the Hessian at the MAP as a preconditioner, was again 
used to sample the coarse posterior. It is relatively simple to compute gradients of the coarse 
posterior using adjoint methods, allowing us to use the Langevin approach effectively. Even 
though the coarse sampling problem still has d 7 = 384 parameters, the coarse map S c itself 
captures much of the problem structure and the coarse MCMC chain mixes remarkably well, 
achieving a near-optimal acceptance rate of 60%. Each coarse MCMC chain was run for 
2 x 10 5 steps. Ten independent parallel chains were run, and coarse sampling was completed 
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Figure 5. Comparison of the true coarse prior density and the coarse prior density induced by the map 
S c . A degree-7 Hermite polynomial expansion was used to parameterize Sf. The first coarse parameter on 
each coarse cell, corresponding to 71 , is the most difficult for the map to capture because of its log-normal 
shape. The color scales, contour levels, and axis bounds are the same for both plots. 


in 49 minutes. After coarse-scale MCMC sampling, the coarse samples were combined with 
independent samples of rj through Sf to generate posterior samples of the fine-scale variable 
6. This coarse-to-fine sampling took 61 minutes. Figure 6 shows the posterior mean and 
variance as well as two posterior samples. A single fine-scale sample was generated for 
each coarse sample. Notice that the fine-scale realizations have the same rough fine-scale 
structure as the true log(K) field. This is an important feature that would not be present in 
many methods based on a priori dimension reduction, such as truncated KL expansions. 

7. Discussion. We have developed a method for efficiently solving Bayesian inference 
problems containing multiscale structure, as defined in (2.2). The method uses trans¬ 
port maps to decouple the original inference problem into a well-conditioned and lower¬ 
dimensional coarse-scale sampling problem (sampling 7r(r c |y)), followed by direct coarse- 
to-fine “prolongation” (evaluating Sf with posterior r c samples). By exploiting locality 
and stationarity, we are able to build these transport maps despite the large dimension of 
the spatially distributed parameters of interest. Our method accurately approximates the 
true posterior and can be applied to very large problems that are essentially intractable 
with other sampling methods. We also stress that our approach is not restricted to sub¬ 
surface flow applications or elliptic PDE forward models. The construction of transport 
maps relies entirely on prior samples, and therefore is not problem-specific or tied to spe¬ 
cific probability distributions. We should also point out that nothing in our formulation 
changes when the prior model is hierarchical, e.g., 7r(#|C) with some hyperparameters £. If 
the posterior on the hyperparameters is not of particular interest, then marginal samples 
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of 7 t(9) = f 7r(0\()7r(()d( can directly be generated and our approach used as described.' 
Indeed, exact or approximate satisfaction of the conditional independence assumption (2.2) 
is all that is required to apply our framework. Inverse problems with this structure exist 
in many areas, ranging from tree physiology [24] to materials modeling [45], and numerous 
other problems with scale separation and/or smoothing forward models. 

A typical serial MCMC sampler could take weeks to run on a problem as large as the 
two-dimensional example from Section 6.2. Decoupling the problem using transport maps 
allowed us to solve it in only two hours. Part of this improvement lies in the parallelism 
intrinsic to our approach. All of the prior sampling, much of the optimization used to 
build the transport maps, and all of the post-MCMC coarse-to-fine map evaluations can 
be parallelized. This level of parallelism is not available in MCMC samplers, even when 
multiple chains are run, as MCMC is inherently a serial process. While we used some 
algorithm-level parallelism (employing MPI for parallel model evaluations and parallel map 
construction), our approach would lend itself well to more sophisticated distributed-memory 
or GPU-parallel implementations in the future. Such an implementation could further 
reduce the run time of our framework. 

Of course, allowing our approach to generate approximate posterior samples enables 
computational savings as well. As demonstrated in Section 5, the accuracy of the posterior 
approximation can be controlled by the representations of S c and Sf. In Section 6.2, we 
let Sf be linear in order to mitigate the required computational effort. In applications 
where more exact posterior sampling is required, however, a higher polynomial degree or 
alternative functional representation could be employed. Additionally, if problem-specific 
information is available (such as the locality and stationarity used in Section 5.2.2), it can 
also be incorporated into the map representation to further increase accuracy and reduce 
computational expense. This flexibility is important in practice, and should allow our 
multiscale approach to be applied in diverse areas. Complementing the use of problem- 
specific information is the development of more advanced map construction techniques, 
which could adaptively construct and refine maps within a user-specified form, targeting 
a specified error in (3.8). This is an important area of future research. We emphasize 
that the fundamental idea underlying our approach—interpreting multiscale structure as 
conditional independence, and applying it in a Bayesian setting—is independent of the 
algorithmic specifics of map construction. But advances in the latter will enhance the 
efficiency and applicability of the inference strategy developed here. 
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(a) True 9 = log(/t) field. 



(b) Posterior mean using multiscale approach. 
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Figure 6. Application of the multiscale inference framework to a 10816 -dimensional Bayesian inverse 
problem. 
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